upcScavenger » Data Compression » Wiki: Speech Coding

Speech coding

( Data Compression )

Categories
Sample companding viewed..
Modern speech compressio..

Sub-fields

See also
References
External links

C O N T E N T S

Audio Coding Compression Speech

Rank: 100%

Wiki
Comments
Media

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Common applications of speech coding are mobile telephony and voice over IP (VoIP).M. Arjona Ramírez and M. Minami, "Technology and standards for low-bit-rate vocoding methods," in The Handbook of Computer Networks, H. Bidgoli, Ed., New York: Wiley, 2011, vol. 2, pp. 447–467. The most widely used speech coding technique in mobile telephony is linear predictive coding (LPC), while the most widely used in VoIP applications are the LPC and modified discrete cosine transform (MDCT) techniques.

The techniques employed in speech coding are similar to those used in audio data compression and audio coding where appreciation of psychoacoustics is used to transmit only data that is relevant to the human auditory system. For example, in voiceband speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal retains adequate intelligibility.

Speech coding differs from other forms of audio coding in that speech is a simpler signal than other audio signals, and statistical information is available about the properties of speech. As a result, some auditory information that is relevant in general audio coding can be unnecessary in the speech coding context. Speech coding stresses the preservation of intelligibility and pleasantness of speech while using a constrained amount of transmitted data.P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494. In addition, most speech applications require low coding delay, as latency interferes with speech interaction.J. H. Chen, R. V. Cox, Y.-C. Lin, N. S. Jayant, and M. J. Melchner, A low-delay CELP coder for the CCITT 16 kb/s speech coding standard. IEEE J. Select. Areas Commun. 10(5): 830-849, June 1992.

Categories

Speech coders are of two classes:

Waveform coders

Time-domain: PCM, ADPCM
Frequency-domain: sub-band coding, ATRAC

Linear predictive coding (LPC)
Formant coding
Machine learning, i.e. neural vocoder

Sample companding viewed as a form of speech coding

The A-law and μ-law algorithms used in G.711 PCM digital telephony can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution. Logarithmic companding are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a periodic waveform having a single fundamental frequency with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.

A wide variety of other algorithms were tried at the time, mostly delta modulation variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.

In 2008, G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.

Modern speech compression

Much of the later work in speech compression was motivated by military research into digital communications for Secure voice, where very low data rates were used to achieve effective operation in a hostile radio environment. At the same time, far more processing power was available, in the form of VLSI circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.

The most widely used speech coding algorithms are based on linear predictive coding (LPC). In particular, the most common speech coding scheme is the LPC-based code-excited linear prediction (CELP) coding, which is used for example in the GSM standard. In CELP, the modeling is divided in two stages, a linear predictive stage that models the spectral envelope and a code-book-based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually as line spectral pairs (LSPs). In addition to the actual speech coding of the signal, it is often necessary to use channel coding for transmission, to avoid losses due to transmission errors. In order to get the best overall coding results, speech coding and channel coding methods are chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding.

The modified discrete cosine transform (MDCT) is used in the LD-MDCT technique used by the AAC-LD format introduced in 1999. MDCT has since been widely adopted in voice-over-IP (VoIP) applications, such as the G.729.1 wideband audio codec introduced in 2006,

(2025). 9780470377864, John Wiley & Sons. . ISBN 9780470377864

Apple's FaceTime (using AAC-LD) introduced in 2010, and the CELT codec introduced in 2011. Presentation of the CELT codec by Timothy B. Terriberry (65 minutes of video, see also presentation slides in PDF)

Opus is a free software audio coder. It combines the speech-oriented LPC-based SILK algorithm and the lower-latency MDCT-based CELT algorithm, switching between or combining them as needed for maximal efficiency. It is widely used for VoIP calls in WhatsApp.

9789811068980, Springer. ISBN 9789811068980

(2025). 9781119488057, John Wiley & Sons. ISBN 9781119488057

The PlayStation 4 video game console also uses Opus for its PlayStation Network system party chat.

A number of codecs with even lower have been demonstrated. Codec2, which operates at bit rates as low as , sees use in amateur radio. NATO currently uses MELPe, offering intelligible speech at and below.Alan McCree, “A scalable phonetic vocoder framework using joint predictive vector quantization of MELP parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2006, pp. I 705–708, Toulouse, France Neural vocoder approaches have also emerged: Lyra by Google gives an "almost eerie" quality at . Microsoft's Satin also uses machine learning, but uses a higher tunable bitrate and is wideband.

Sub-fields

Wideband audio coding

Linear predictive coding (LPC)

AMR-WB for WCDMA networks
VMR-WB for CDMA2000 networks
Speex, IP-MR, SILK (part of Opus), and USAC/xHE-AAC for VoIP and videoconferencing

Modified discrete cosine transform (MDCT)

AAC-LD, G.722.1, G.729.1, CELT and Opus for VoIP and videoconferencing

Adaptive differential pulse-code modulation (ADPCM)

G.722 for VoIP

Neural speech coding

Lyra (Google): V1 uses neural network reconstruction of log-mel spectrogram; V2 is an end-to-end autoencoder.
Satin (Microsoft)
LPCNet (Mozilla, Xiph): neural network reconstruction of LPC features

Narrowband audio coding

FNBDT for military applications
SMV for CDMA networks
Full Rate, Half Rate, EFR and AMR for GSM networks
G.723.1, G.728, G.729, G.729.1 and iLBC for VoIP or videoconferencing

ADPCM

G.726 for VoIP

Multi-Band Excitation (MBE)

AMBE+ for digital radio mobile radio and satellite phone
Codec 2

Speech coding

( Data Compression )

Account

Navigation

Statistics

Speech coding ( Data Compression )

Account

Navigation

Statistics

Speech coding

( Data Compression )